This notebook is used as a "first look" of the kaggle competition "Google Landmark Recognition 2020" train set.
# imports for code
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import cv2
# load the train data csv file as data frame
train_df = pd.read_csv("C:/Users/Matan/Desktop/projectB/data/train/train.csv")
train_df
| id | landmark_id | |
|---|---|---|
| 0 | 17660ef415d37059 | 1 |
| 1 | 92b6290d571448f6 | 1 |
| 2 | cd41bf948edc0340 | 1 |
| 3 | fb09f1e98c6d2f70 | 1 |
| 4 | 25c9dfc7ea69838d | 7 |
| ... | ... | ... |
| 1580465 | 72c3b1c367e3d559 | 203092 |
| 1580466 | 7a6a2d9ea92684a6 | 203092 |
| 1580467 | 9401fad4c497e1f9 | 203092 |
| 1580468 | aacc960c9a228b5f | 203092 |
| 1580469 | d9e338c530dca106 | 203092 |
1580470 rows × 2 columns
We would like to look at some of the images from the train set:
# load train image path and labels as a dictionary and then convet to dataframe
train_path_label_dict = {'image': [], 'target': []}
for i in range(train_df.shape[0]):
train_path_label_dict['image'].append(
"D:/dataset/train" + '/' +
train_df['id'][i][0] + '/' +
train_df['id'][i][1]+ '/' +
train_df['id'][i][2]+ '/' +
train_df['id'][i] + ".jpg")
train_path_label_dict['target'].append(train_df['landmark_id'][i])
train_path_label_df = pd.DataFrame(train_path_label_dict)
images = []
for i in range(111,131):
img = cv2.imread(train_path_label_df.image[111:131][i])
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
images.append(img)
f, ax = plt.subplots(4,5, figsize=(15,15))
for i, img in enumerate(images):
ax[i//5, i%5].imshow(img)
ax[i//5, i%5].axis('off')
As we can see from this small sample, the images of the test set differ in size and color. In addition, they are images of very different landmarks and some of them are not necessarily of the landmark itself but its interior.
Some of the train set properties:
print("number of ids is: {}" .format(train_df['id'].size))
print("\nnumber of unique values in landmark_id column is: {}" .format(train_df['landmark_id'].nunique()))
number of ids is: 1580470 number of unique values in landmark_id column is: 81313
As we can see, the data set composed out of 1,580,470 pictures that divided into 81,313 classes. This amount of objects and classes in the data set makes it a really challenging one.
The train set histogram:
fig = plt.figure(figsize = (12, 8))
ax = train_df.plot.hist(bins=81313, grid=False, rwidth=0.9)
ax.set_xlabel("Class Number", labelpad=20, weight='bold', size=12)
ax.set_ylabel("Number of Objects", labelpad=20, weight='bold', size=12)
ax.set_title("Histogram of Objects Distribution")
plt.show()
<Figure size 864x576 with 0 Axes>
And zoomed-in:
fig1 = plt.figure(figsize = (12, 8))
ax1 = train_df.plot.hist(bins=81313, grid=False, rwidth=0.9)
ax1.set_xlabel("Class Number", labelpad=20, weight='bold', size=12)
ax1.set_ylabel("Number of Objects", labelpad=20, weight='bold', size=12)
ax1.set_title('Histogram of Objects Distribution - Zoomed In')
plt.axis([138982-150,138982+150,0,6400]) # zoomed in on the class with most items in it
plt.show()
<Figure size 864x576 with 0 Axes>
As we can see from the histogram plot of the train set, there are some huge variation in the number of objects in every class. Therefore the train set distribution will be long-tailed.
We would like to look more carefully at the top and bottom classes in the data set.
We'll start with the top classes:
print("top 10 classes:\n")
print(train_df['landmark_id'].value_counts().head(10))
top 10 classes: 138982 6272 126637 2231 20409 1758 83144 1741 113209 1135 177870 1088 194914 1073 149980 971 139894 966 1924 944 Name: landmark_id, dtype: int64
We will look at some of the images from the top 10 classes:
top10 = train_df['landmark_id'].value_counts().head(10).index
images = []
for i in range(10):
img = cv2.imread(train_path_label_df[train_path_label_df.target == top10[i]]['image'].values[7])
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
images.append(img)
f, ax = plt.subplots(5,2, figsize=(20,15))
for i, img in enumerate(images):
ax[i//2, i%2].imshow(img)
ax[i//2, i%2].axis('off')
fig2 = plt.figure(figsize = (12,8))
sns.countplot(x=train_df.landmark_id, order = train_df['landmark_id'].value_counts().head(10).index)
plt.xlabel("Class Number")
plt.ylabel("Number of Objects")
plt.title("Top 10 Classes in the Train Set")
plt.show()
As we can see from the table and from the graph, class number 138982 is the biggest class by a big margin. It contain 6272 where none of the rest top classes contain more than 2500 objects.
We will look now at 12 images from the top 5 classes:
top5 = train_df['landmark_id'].value_counts().head(5).index
for i in range(5):
images = []
for j in range(12):
img=cv2.imread(train_path_label_df[train_path_label_df.target == top5[i]]['image'].values[j])
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
images.append(img)
fig, ax = plt.subplots(3,4,figsize=(20,15))
fig.suptitle("\n\n\nclass {}".format(top5[i]), fontsize=16)
for k, img in enumerate(images):
ax[k//4, k%4].imshow(img)
ax[k//4, k%4].axis('off')
As we can see from the images above, the difference between the images in the same class is big and in some cases it is not clear why so different images are part of the same class. This is another major challenge of the data set, the large intra-class variability.
We'll look now at the top 50 classes:
top50 = train_df['landmark_id'].value_counts().head(50).index
images = []
for i in range(50):
img = cv2.imread(train_path_label_df[train_path_label_df.target == top50[i]]['image'].values[12])
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
images.append(img)
f, ax = plt.subplots(10,5, figsize=(30,30))
for i, img in enumerate(images):
ax[i//5, i%5].imshow(img)
ax[i//5, i%5].axis('off')
fig3 = plt.figure(figsize = (12,8))
sns.countplot(x=train_df.landmark_id, order = train_df['landmark_id'].value_counts().head(50).index)
plt.xlabel("Class Number")
plt.ylabel("Number of Objects")
plt.title("Top 50 Classes in the Train Set")
plt.xticks(rotation = 90)
plt.show()
This graph simulate one of the major challanges of the given train set - the long-tailed distribtuion, which is quite long, even for only 50 classes (out of more than 80K).
We'll look now at the bottom classes. We'll start from the bottom 10:
print("bottom 10 classes:\n")
print(train_df['landmark_id'].value_counts().tail(10))
bottom 10 classes: 110417 2 59905 2 4171 2 73532 2 195143 2 180503 2 179834 2 183115 2 63266 2 197219 2 Name: landmark_id, dtype: int64
bottom10 = train_df['landmark_id'].value_counts().tail(10).index
images = []
for i in range(10):
img = cv2.imread(train_path_label_df[train_path_label_df.target == bottom10[i]]['image'].values[1])
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
images.append(img)
f, ax = plt.subplots(5,2, figsize=(20,15))
for i, img in enumerate(images):
ax[i//2, i%2].imshow(img)
ax[i//2, i%2].axis('off')
fig4 = plt.figure(figsize = (12,8))
sns.countplot(x=train_df.landmark_id, order = train_df['landmark_id'].value_counts().tail(10).index)
plt.xlabel("Class Number")
plt.ylabel("Number of Objects")
plt.title("Bottom 10 Classes in the Train Set")
plt.show()
As we can see all the 10 bottom classes have only 2 objects in them.
bottom50 = train_df['landmark_id'].value_counts().tail(50).index
images = []
for i in range(50):
img = cv2.imread(train_path_label_df[train_path_label_df.target == bottom50[i]]['image'].values[0])
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
images.append(img)
f, ax = plt.subplots(10,5, figsize=(30,30))
for i, img in enumerate(images):
ax[i//5, i%5].imshow(img)
ax[i//5, i%5].axis('off')
fig5 = plt.figure(figsize = (12,8))
sns.countplot(x = train_df.landmark_id, order = train_df['landmark_id'].value_counts().tail(50).index)
plt.xlabel("Class Number")
plt.ylabel("Number of Objects")
plt.title("Bottom 50 Classes in the Train Set")
plt.xticks(rotation = 90)
plt.show()
The 50 bottom classes, like to bottom 10, have only 2 objects in them.
As a matter of fact more than half of the classes have 9 objects or less:
# half of the classes is 40656.5, therfore 40660 is more than half
print(train_df['landmark_id'].value_counts().tail(40660))
74931 9
146017 9
105350 9
49287 9
70133 9
..
180503 2
179834 2
183115 2
63266 2
197219 2
Name: landmark_id, Length: 40660, dtype: int64
This is another aspect of the long-tailed distribution, a lot of classes have small amount of objects, this will make the training process more difficult and challenging.
This is the train set we will use to train our algorithm.